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Abstract. We present an efficient algorithm for computing the LZ78 
factorization of a text, where the text is represented as a straight line 
program (SLP), which is a context free grammar in the Chomsky normal 
form that generates a single string. Given an SLP of size n representing 
a text S of length N, our algorithm computes the LZ78 factorization 
of T in 0{ny/N + mlogiV) time and 0(n\ /r N + m) space, where m is 
the number of resulting LZ78 factors. We also show how to improve the 
algorithm so that the ny/N term in the time and space complexities 
becomes either nL, where L is the length of the longest LZ78 factor, or 
(N — a) where a > is a quantity which depends on the amount of 
redundancy that the SLP captures with respect to substrings of S of a 
certain length. Since m = 0(N/ log CT N) where a is the alphabet size, the 
latter is asymptotically at least as fast as a linear time algorithm which 
runs on the uncompressed string when a is constant, and can be more 
efficient when the text is compressible, i.e. when m and n are small. 

1 Introduction 

Large scale textual data are usually stored in compressed form, while it is later 
decompressed to be used. In order to circumvent the computational resources 
required to handle and process the cumbersome uncompressed string, the com- 
pressed string processing (CSP) approach has been gaining attention. The aim 
of CSP is to process text given in compressed form without explicitly decom- 
pressing the entire text, therefore allowing space efficient, as well as time efficient 
processing of the text when it is sufficiently compressed. 

Many CSP algorithms work on a representation of the compressed text called 
straight line programs (SLPs). An SLP is a context free grammar in the Chomsky 
normal form that derives a single string. SLPs can efficiently model the outputs 
of many different types of compression algorithms (e.g.: grammar based |22ll7j . 
dictionary based [28129] ) . and hence, an algorithm that works on an SLP can 
be applied to texts compressed by various compression algorithms. On the other 
hand, there are many CSP algorithms which make use of specific properties that 
are implicit in the compressed representation C(S) of text S obtained by using 
a certain compression algorithm C [4191 1011 1] . Such CSP algorithms cannot be 
applied to representations produced by any arbitrary compression algorithm. To 
overcome this problem, we consider the problem of computing the compressed 



representation C(S) from an arbitrary SLP representing S, without completely 
decompressing the SLP. 

In this paper, we focus on the well known LZ78 compression algorithm [29 . 
LZ78 compresses a given text based on a dynamic dictionary which is con- 
structed by partitioning the input string, the process of which is called LZ78 
factorization. Other than its obvious use for compression, the LZ78 factorization 
is an important concept used in various string processing algorithms and ap- 
plications 7 19 1 8120] . The contribution of this paper is an 0(nvN + mlogN) 
time and 0(ny/~N + m) space algorithm to compute the LZ78 factorization of a 
string given as an SLP, where ./V is the length of the string, n is the size of the 
SLP, and m is the number of LZ78 factors. 

We further show how to improve the n\^N term in the time and space com- 
plexities in two ways. An application of doubling search enables the term to 
be reduced to nL, where L is the longest LZ78 factor. Also, by applying the 
recent techniques of [13], the term can be reduced to N — a, where a > 
is a quantity which depends on the amount of redundancy that the SLP cap- 
tures with respect to substrings of S of a certain length. Since it is known that 
m = 0(N '/ logg. N) [29], where a is the alphabet size, our approach is guaranteed 
to be asymptotically at least as fast as a linear time algorithm which runs on the 
uncompressed string if a is considered constant, and can be even more efficient 
when the text is compressible, i.e. when m and n are small. 

As a byproduct of the above results, we also obtain an efficient algorithm 
which converts a given LZ77 factorization of a string [28] to the corresponding 
LZ78 factorization without explicit decompression. We conclude the paper by 
mentioning several other interesting potential applications of our algorithm. 

Related Work 

An efficient algorithm for computing the LZ78 factorization was presented in [14]. 
Their algorithm requires only 0(N (log a + loglog CT N)/ \og a N) bits of work- 
ing space and runs in 0(iV(loglog N) 2 /(\og a iV log log log TV)) worst-case time 

oflog N 105 105 105 "Cf ) 

which is sub- linear when a — 2 (io g io g «)^ \ However, their input assumes 
the uncompressed text and it is unknown how to apply their algorithm without 
completely decompressing the SLP. 

2 Preliminaries 
2.1 Strings 

Let S be a finite alphabet and a — \S\. An element of E* is called a string. The 
length of a string S is denoted by The empty string e is a string of length 0, 
namely, |e| = 0. For a string S = XYZ, X, Y and Z are called a prefix, substring, 
and suffix of S, respectively. The set of all substrings of a string S is denoted 
by Substr(S). The i-th character of a string S is denoted by S[i] for 1 < i < \S\, 
and the substring of a string S that begins at position i and ends at position j is 
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denoted by S[i : j] for 1 < i < j < \S\. For convenience, let S[i : j] = e if j < i. 
For a string S and integer q > 0, let pre{S, q) and suf(S, q) represent respectively, 
the length-g prefix and suffix of T, that is, pre(S,q) = S[l : min{g, |5|}] and 
suf(S,q) = S[max{l, \S\ — q + 1} : \S\]. We also assume that the last character 
of the string is a special character '$' that does not occur anywhere else in the 
string. 

Our model of computation is the word RAM: We shall assume that the 
computer word size is at least log \S\, and hence, standard operations on values 
representing lengths and positions of string S can be manipulated in constant 
time. Space complexities will be determined by the number of computer words 
(not bits). 

2.2 Straight Line Programs 

A straight line program (SLP) is a set of assignments T = {X\ — > expri, X 2 — > 
expr 2 , . . . , X n — > expr n }, where each Xi is a distinct non-terminal variable and 
each expri is an expression that can be either expri = a (a G S), or expri = 
X^i)X r (i) (i > £(i),r(ij). An SLP is essentially a context free grammar in the 
Chomsky normal form, that derives a single string. Let val(Xi) represent the 
string derived from variable Xi. To ease notation, we sometimes associate val(Xi) 
with Xi and denote \val(Xi)\ as \Xi\. An SLP T represents the string T = 
val(X n ). The size of the program T is the number n of assignments in T. 

The derivation tree of SLP T is a labeled ordered binary tree where each 
internal node is labeled with a non-terminal variable in {X\, . . . , X n }, and each 
leaf is labeled with a terminal character in S. The root node has label X n . 
Let V denote the set of internal nodes in the derivation tree. For any internal 
node v £ V, let (v) denote the index of its label X( v y Node v has a single child 
which is a leaf labeled with c when (X^ — >• c) e T for some c £ or d has 
a left-child and right-child respectively denoted £(v) and r(v), when (X/ v \ — > 
X(g( v ))X( r ( v ^) e T ■ Each node v of the tree derives val(X^), a substring of 
T, whose corresponding interval itv(v) — [b : e], with T[b : e] = val(X/ v \), can 
be defined recursively as follows. If v is the root node, then itv(v) = [1 : \T\]. 
Otherwise, if (X^ -> X^ V ^X^ V ^) e T, then, itv(£(v)) = [b v : b v + \X^ v ^\-l] 
and itv(r(v)) = [b v + \X^ V ^\ : e v ], where [b v : e v ] = itv(v). Let vOcc(Xi) 
denote the number of times a variable Xi occurs in the derivation tree, i.e., 
vOcc(X i ) = \{v\X {v) =X i }\. 

For any interval [b : e] of T(l < b < e < \T\), let £r(&> e ) denote the deepest 
node v in the derivation tree, which derives an interval containing [b : e], that 
is, itv(v) D [b : e], and no proper descendant of v satisfies this condition. We say 
that node v stabs interval [b : e], and X^ is called the variable that stabs the 
interval. We have (X/ v \ — > X/M v }\X/ r r v \\) G T, b € itv(£(v)), and e G itw(r(w)). 
When it is not confusing, we will sometimes use ^-r{b, e) to denote the variable 

SLPs can be efficiently pre-processed to hold various information. and 
vOcc(Xi) can be computed for all variables Xi (1 < i < n) in a total of 0(n) 
time by a simple dynamic programming algorithm. 
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Fig. 1. The derivation tree of SLP T = {X 1 -> a, X 2 -> b, X 3 -> XiX 2 , X 4 -> XiX 3 , 
X 5 X3X4, X 6 -> X4X5, X 7 -> X 6 X 5 }. T = val(X 7 ) = aababaababaab. 



2.3 LZ78 Encoding 

Definition 1 (LZ78 factorization). TTie LZ78-factorization of a string S is 
the factorization f\ - ■ ■ f m of S, where each LZ78-factor fa 6 i7 + (1 < i < m) is 
the longest prefix of fi ■ ■ ■ f m , such that /j 6 {/j'C | 1 < j ' < i, c € E} U U. 

For a given string 5*, let m denote the number of 
factors in its LZ78 factorization. The LZ78 factor- 
ization of the string can be encoded by a sequence of 
pairs, where the pair for factor /j consists of the ID 
j of the previous factor fj (j = and fo=e when 
there is none) and the new character <S[|/i • • • fi\]. 
Regarding this pair as a parent and edge label, the 
factors can also be represented as a trie. (See Fig.[2j) 

By using this trie, the LZ78 factorization of 
a string of length N can be easily computed in- 
crementally in O(A^logcr) time and 0(m) space; 
Start from an empty tree with only the root. For 
1 < i < to, to calculate let v be the node of 
the trie reached by traversing the tree with S\p : q], 
where p = |/o •• • f%-i\ + 1, and q > p is the small- 
est position after p such that v does not have an 
outgoing edge labeled with S[q + I). Naturally, v 
represents the longest previously used LZ78-factor 
that is a prefix of S[p : \S\]. Then, we can insert an edge labeled with S[q + 1] to 
a new node representing factor /j, branching from v. The update for each factor 
fi can be done in 0(|/i|log<r) time for the traversal and in O(logcr) time for 
the insertion, with a total of 0(N log a) time for all the factors. Since each node 
of the trie except the root corresponds to an LZ78 factor, the size of the trie is 
0(m). 




Fig. 2. The LZ78 dic- 
tionary for the string 
aaabaabbbaaaaaaaba$ . 

Each node numbered i 
represents the factor fi 
of the LZ78 factorization, 
where fi is the path label 
from the root to the node, 
e-g-: h = aa, / 4 = aab. 
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Example 1. The LZ78 factorization of string aaabaabbbaaaaaaabaS is a, aa, b, 
aab, bb, aaa, aaaa, ba, $, and can be represented as (0,a), (l,a), (0,b), (2,b), 
(3,b), (2, a), (6, a), (3, a), (0,$). 

2.4 Suffix Trees 

We give the definition of a very important and well known string index structure, 
the suffix tree. To assure property |3] for the sake of presentation, we assume that 
the string ends with a unique symbol that does not occur elsewhere in the string. 

Definition 2 (Suffix Trees [26j ) . For any string S, its suffix tree, denoted 
ST(S), is a labeled rooted tree which satisfies the following: 

1. each edge is labeled with an element in S + ; 

2. there exist exactly n leaves, where n = \S\; 

3. for each string s £ Suffix (S), there is a unique path from the root to a leaf 
which spells out s; 

4- each internal node has at least two children; 

5. the labels x and y of any two distinct out-going edges from the same node 
begin with different symbols in £ 

Since any substring of S is a prefix of some suffix of S, positions in the suffix 
tree of S correspond to a substring of S that is represented by the string spelled 
out on the path from the root to the position. We can also define a generalized 
suffix tree of a set of strings, which is simply the suffix tree that contains all 
suffixes of all the strings in the set. 

It is well known that suffix trees can be represented and constructed in linear 
time [26 21 25], even independently of the alphabet size for integer alphabets [5]. 
Generalized suffix trees for a set of strings S = {Si, . . . ,Sk}, can be constructed 
in linear time in the total length of the strings, by simply constructing the suffix 
tree of the string • • • Sk$k, and pruning the tree below the first occurrence 
of any $j, where $i (1 < i < k) are unique characters that do not occur elsewhere 
in strings of S. 

3 Algorithm 

We describe our algorithm for computing the LZ78 factorization of a string 
given as an SLP in two steps. The basic structure of the algorithm follows the 
simple LZ78 factorization algorithm for uncompressed strings that uses a trie 
as mentioned in Section 12.31 Although the space complexity of the trie is only 
0(m), we need some way to accelerate the traversal of the trie in order to achieve 
the desired time bounds. 



5 



3.1 Partial Decompression 

We use the following property of LZ78 factors which is straightforward from its 
definition. 

Lemma 1. For any string S of length N and its LZ78- factorization /i • • ■ f m , 
m > cjv and \ fi\ < cn for all 1 <i <m, where cn — y/2N + 1/4 — 1/2. 

Proof. Since a factor can be at most 1 character longer than a previously used 
factor, |/ t | < i. Therefore, N = Y,? = i \fA ^ ThLi h and thus m > ^2^ + 1/4- 
1/2. For any factor of length x = \fi x \, there exist distinct factors fa,..., fi x _ 1 
whose lengths are respectively 1, . . . , x — 1. Therefore, N = Y^T=i — Ym=i 
and x < yf2N + 1/4- 1/2. □ 

The lemma states that the length of an LZ78-factor is bounded by cat. 
To utilize this property, we use ideas similar to those developed in [12113] for 
counting the frequencies of all substrings of a certain length in a string rep- 
resented by an SLP; For simplicity, assume cjy > 2. For each variable Xi —> 
X(ral r (i), any length cjv substring that is stabbed by Xi is a substring of 
ti = suf(val(Xm)),CN — l)pre(val(X r (i)),cjsr — l)- On the other hand, all length 
cjv substrings are stabbed by some variable. This means that if we consider the 
set of strings consisting of U for all variables such that \Xi\ > cn, any length 
cat substring of S is a substring of at least one of the strings. We can com- 
pute all such strings T$ = {U | \Xi\ > cn} where {Xi — > XmjX r u\) € T in 
time linear in the total length, i.e. 0(ucn) time by a straightforward dynamic 
programming [12 . 

All length cat substrings of S occur as substrings of strings in T$, and by 
Lemma [TJ it follows that T$ contains all LZ78-factors of S as substrings. 

3.2 Finding the Next Factor 

In the previous subsection, we described how to partially decompress a given 
SLP of size n representing a string S of length N, to obtain a set of strings Ts 
with total length 0(n\/~N), such that any LZ78-factor of S is a substring of at 
least one of the strings in T$. We next describe how to identify these substrings. 

We make the following key observation: since the LZ78-trie of a string S is 
a trie composed by substrings of S, it can be superimposed on a suffix tree of 
S, and be completely contained in it, with the exception that some nodes of the 
trie may correspond to implicit nodes of the suffix tree (in the middle of an edge 
of the suffix tree). Furthermore, this superimposition can also be done to the 
generalized suffix tree constructed for Ts- (See Fig. [3]) 

Suppose we have computed the LZ78 factorization f\ - ■ ■ fi-%, up to position 
p— 1 = |/i • • • and wish to calculate the next LZ78-factor starting at 
position p. Let v = £,t{p,P + c n ~ 1), let Xj = X/ v \ be the variable that stabs 
the interval [p : p + cm — 1] , let q be the offset of p in tj , and let w be the leaf of 
the generalized suffix tree that corresponds to the suffix tj[q : \tj\}. The longest 
previously used factor that is a prefix of S[p : \S\] is the longest common prefix 
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Fig. 3. The LZ78-trie of string S = aababaababaab, superimposed on the generalized 
suffix tree of Ts = {£5, te, £7} = {abaab$ 5 , aababa$ 6 , aababa$7} for the SLP of Fig. [T] 
Here, $ 5 ,$6,$7 are end markers of each string in 7s, introduced so that each position 
in a string of T s corresponds to a leaf of the suffix tree. The subtree consisting of the 
dark nodes is the LZ78-trie, derived from the LZ78- factorization: a, ab, aba, abab, aa, b, 
of S. Since any length [cjvj = 4 substring of S is a substring of at least one string in 
Ts, any LZ78-factor of S is a substring of some string of Ts, and the generalized suffix 
tree of 7s completely includes the LZ78-trie. 



between tj[q : \tj\] and all possible paths on the LZ78-trie built so far. If we 
consider the suffix tree as a semi-dynamic tree, where nodes corresponding to 
the superimposed LZ78-trie are dynamically added and marked, the node x we 
seek is the nearest marked ancestor of w. 

The generalized suffix tree for Ts can be computed in 0(ny/N) time. We 
next describe how to obtain the values v, q (and therefore w), and x as well as 
the computational complexities involved. 

A naive algorithm for obtaining v and q would be to traverse down the 
derivation tree of the SLP from the root, checking the decompressed lengths of 
the left and right child of each variable to determine which child to go down, in 
order to find the variables that correspond to positions p and p+cjy — 1. By doing 
the search in parallel, we can find v as the node at which the search for each 
position diverges, i.e., the lowest common ancestor of leaves in the derivation 
tree corresponding to positions p and p + cn — 1. This traversal requires 0(h) 
time, where h is the height of the SLP, which can be as large as 0(n). To do this 
more efficiently, we can apply the algorithm of [5], which allows random access 
to arbitrary positions of the SLP in 0(log N) time, with 0(n) time and space of 
preprocessing. 
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Theorem 1 (|5j). For an SLP of size n representing a string of length N, 
random access can be supported in time 0(logN) after 0(n) preprocessing time 
and space in the RAM model. 

Their algorithm basically constructs data structures in order to simulate the 
traversal of the SLP from the root, but reduces the time complexity from 0(h) 
to O(logiV). Therefore, by running two random access operations for positions 
p and p + cjy — 1 in parallel until they first diverge, we can obtain v in 0(log N) 
time. We note that this technique is the same as the first part of their algorithm 
for decompressing a substring S[i : j] of length m = j — i + 1 in 0(m + log TV) 
time. The offset of p from the beginning of X/ v \ can be obtained as a byproduct 
of the search for position p, and therefore, q can also be computed in O (log TV) 
time. 

For obtaining x, we use a data structure that maintains a rooted dynamic 
tree with marked/unmarked nodes such that the nearest marked ancestor in the 
path from a given node to the root can be found very efficiently. The following 
result allows us to find x - the nearest marked ancestor of w - in amortized 
constant time. 

Lemma 2 ([27, lj). A semi- dynamic rooted tree can be maintained in linear 
space so that the following operations are supported in amortized 0(1) time: 1) 
find the nearest marked ancestor of any node; 2) insert an unmarked node; 3) 
mark an unmarked node. 

For inserting the new node for the new LZ78-factor, we simply move down the 
edge of the suffix tree if x was an implicit node and has only one child. When 
x is branching, we can move down the correct suffix tree using level ancestor 
queries of the leaf w, therefore not requiring an O(loger) factor. 

Lemma 3 (Level ancestor query [3,2j). Given a static rooted tree, we can 
preprocess the tree in linear time and space so that the tth node in the path from 
any node to the root can be found in 0(1) time for any integer I > 0, if such 
exists. 

Technically, our suffix tree is semi-dynamic in that new nodes are created since 
the LZ78-trie is superimposed. However, since we are only interested in level 
ancestor queries at branching nodes, we only need to answer them for the original 
suffix tree. Therefore, we can preprocess the tree in 0(ny/~N) time and space to 
answer the level ancestor queries in 0(1) time. 
The main result of this section follows: 

Theorem 2. Given an SLP of size n representing a string S of length N, we can 
compute the LZ 78 factorization of S in 0(ny/N + mlogiV) time and 0(nyN + 
m) space, where m is the size of the LZ78 factorization. 

A better bound can be obtained by employing a simple doubling search on the 
length of partial decompressions. 
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Corollary 1. Given an SLP of size n representing a string S of length N , we 
can compute the LZ78 factorization of S in 0(nL+mlogN) time and 0(nL+m) 
space, where m is the size of the LZ78 factorization, and L is the length of the 
longest LZ78 factor. 

Proof. Instead of using cjv for the length of partial decompressions, we start 
from length 2. For some length 2 l_1 , if the LZ78 trie outgrows the suffix tree 
and reaches a leaf, we rebuild the suffix tree and the embedded LZ78 trie for 
length 2 l and continue with the factorization. This takes 0(n2 l ) time, and the 
total asymptotic complexity becomes n(2 + • • ■ + 2 r io S2 L l ) — 0{nL). Notice that 
the m log N term does not increase, since the factorization itself is not restarted, 
and also since the data structure of [5] is reused and only constructed once. □ 

3.3 Reducing Partial Decompression 

By using the same techniques of [T3], we can reduce the partial decompression 
conducted on the SLP, and reduce the complexities of our algorithm. Let I = 
{i | \Xi\ > c/v} C [1 : n]. The technique exploits the overlapping portions of 
each of the strings in Tg. The algorithm of [13] shows how to construct, in time 
linear of its size, a trie of size (cm — 1) + Siei(l*»l — ( c n — 1)) = N — a — N a 
such that there is a one to one correspondence between a length cn path on the 
trie and a length cjv substring of a string in Tg. Here, 

a = Y,((vOcc(Xi) ~ 1) • Cl**l ~ (cjv - 1))) > (1) 
iei 

can be seen as a quantity which depends on the amount of redundancy that the 
SLP captures with respect to length cjq substrings. 

Furthermore, a suffix tree of a trie can be constructed in linear time: 

Lemma 4 Q24|). Given a trie, the suffix tree for the trie can be constructed in 
linear time and space. 

The generalized suffix tree for Tg used in our algorithm can be replaced with the 
suffix tree of the trie, and we can reduce the 0(ny/~N) term in the complexity to 
0(N a ), thus obtaining an 0(N a +m\og N) time and 0(N a +m) space algorithm. 
Since N a is also bounded by 0(ny/~N), we obtain the following result: 

Theorem 3. Given an SLP of size n representing a string S of length N , we can 
compute the LZ78 factorization of S in 0(N a + to log N) time and 0(N a + m) 
space, where m is the size of the LZ78 factorization, N a = 0(min{iV— a, nyN}), 
and a > is defined as in Equation (Qp. 

Since m — 0(N/ log CT N) [29] . our algorithms are asymptotically at least as 
fast as a linear time algorithm which runs on the uncompressed string when 
the alphabet size is constant. On the other hand, N a can be much smaller than 
0(nyJ~N) when vOcc(Xi) > 1 for many of the variables. Thus our algorithms 
can be faster when the text is compressible, i.e., n and m are small. 
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3.4 Conversion from LZ77 Factorization to LZ78 Factorization 

As a byproduct of the algorithm proposed above, we obtain an efficient algorithm 
that converts a given LZ77 factorization |28j of a string to the corresponding 
LZ78 factorization, without explicit decompression. 

Definition 3 (LZ77 factorization). The LZ77- factorization of a string S is 
the factorization /i, . . . , f r of S such that for every i = 1, . . . , r, factor fi is the 
longest prefix of f\ - ■ ■ f r with fi € Fi, where Fi = Substr(fi ■ ■ ■ /t_i) U U. 

It is known that the LZ77-factorization of string S can be efficiently trans- 
formed into an SLP representing S. 

Theorem 4 (|23j). Given the LZ77 factorization of size r for a string S of 
length N, we can compute in 0(r log N) time an SLP representing S, of size 
0(r log N) and of height O(logiV). 

The following theorem is immediate from Corollary Q] and Theorem |4j 

Theorem 5. Given the LZ77 factorization of size r for a string S of length N , 
we can compute the L Z7 8 factorization for S in 0(rL log N + m log N) time and 
0(rL log N + to) space, where to is the size of the LZ78 factorization for S, and 
L is the length of the longest LZ78 factor. 

It is also possible to improve the complexities of the above theorem using 
Theorem [3J so that the conversion from LZ77 to LZ78 can be conducted in 
0(N a + mlogiV) time and 0(N a + to) space, where N a here is defined for 
the SLP generated from the input LZ77 factorization. This is significant since 
the resulting algorithm is at least as efficient as a naive approach which requires 
decompression of the input LZ77 factorization, and can be faster when the string 
is compressible. 

4 Discussion 

We showed an efficient algorithm for calculating the LZ78 factorization of a 
string S, from an arbitrary SLP of size n which represents 5*. The algorithm is 
guaranteed to be asymptotically at least as fast as a linear time algorithm that 
runs on the uncompressed text, and can be much faster when n and m are small, 
i.e., the text is compressible. 

It is easy to construct an SLP of size 0{m) that represents string S, given 
its LZ78 factorization whose size is m [16]. Thus, although it was not our pri- 
mary focus in this paper, the algorithms we have developed can be regarded 
as a re- compression by LZ78, of strings represented as SLPs. The concept of 
re-compression was recently used to speed up fully compressed pattern match- 
ing [15] . We mention two other interesting potential applications of re-compression, 
for which our algorithm provides solutions: 
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Maintaining Dynamic SLP Compressed Texts 

Modification to the SLP corresponding to edit operations on the string that it 
represents, e.g.: character substitutions, insertions, deletions can be conducted 
in 0(h) time, where h is the height of the SLP. However, these modifications 
are ad-hoc, and there are no guarantees as to how compressed the resulting 
SLP is, and repeated edit operations will inevitably cause degradation on the 
compression ratio. By periodically re-compressing the SLP, we can maintain the 
compressed size (w.r.t. LZ78) of the representation, without having to explicitly 
decompress the entire string during the maintenance process. 

Computing the NCD w.r.t. LZ78 without explicit decompression 

The Normalized Compression Distance (NCD) [6] measures the distance be- 
tween two data strings, based on a specific compression algorithm. It has been 
shown to be effective for various clustering and classification tasks, while not 
requiring in-depth prior knowledge of the data. NCD between two strings S and 
T w.r.t. compression algorithm A is determined by the values Ca(ST), Ca(S), 
and Ca{T), which respectively denote the sizes of the compressed representation 
of strings ST, S, and T when compressed by algorithm A. 

When S and T are represented as SLPs, we can compute Clz7s('S') and 
Clz78(T) without explicitly decompressing all of S and T, using the algorithms 
in this paper. Furthermore, the SLP for the concatenation ST can be obtained 
by simply considering a new single variable and production rule Xst — > XsXt, 
where Xs and Xt are respectively the roots of the SLP which derive S and 
T. Thus, by applying our algorithm on this SLP, we can compute C-i,z7s(ST) 
without explicit decompression as well. Therefore it is possible to compute NCD 
w.r.t. LZ78 between strings represented as SLPs, and therefore even cluster or 
classify them, without explicit decompression. 
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